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Abstract 

We present a new record on computing specific bits of tt, the mathematical constant, 
and discuss performing such computations on Apache Hadoop clusters. The specific bits 
represented in hexadecimal are 

0E6C1294 AED40403 F56D2D76 4026265B CA98511D 0FCFFAA1 0F4D28B1 BB5392B8. 

These 256 bits end at the 2,000,000,000,000,252 nd bit position^] which doubles the 
position and quadruples the precision of the previous known record |12j . The position of 
the first bit is 1, 999, 999, 999, 999, 997 and the value of the two quadrillionth bit is 0. 

The computation is carried out by a MapReduce program called DistBbp. To effec- 
tively utilize available cluster resources without monopolizing the whole cluster, we de- 
velop an elastic computation framework that automatically schedules computation slices, 
each a DistBbp job, as either map-side or reduce-side computation based on changing 
cluster load condition. We have calculated tt at varying bit positions and precisions, and 
one of the largest computations took 23 days of wall clock time and 503 years of CPU 
time on a 1000-node cluster. 

1 Introduction 

The computation of the mathematical constant tt has drawn a great attention from math- 
ematicians and computer scientists over the centuries [U [10]. The computation of tt also 
serves as a vehicle for testing and benchmarking computer systems. There are two types of 
challenges, 

(i) computing the first n digits of tt, and 

(ii) computing only the n th bit of tt. 

In this paper, we discuss our experience on computing the n th bit of tt with Apache Hadoop 
(http://hadoop.apache.org), an open-source distributed computing software. To the best 
of our knowledge, the result obtained by us, the Yahoo! Cloud Computing Team, is a new 
world record as this paper being written. 

' When tt is represented in binary, we have tt = 11.0010 0100 0011 1111 . . . Bit position is counted starting 
after the radix point. For example, the eight bits starting at the ninth bit position are 0011 1111 in binary or, 
equivalently, 3F in hexadecimal. 



In 1996, Bailey, Borwein and Plouffe discovered a new formula (equation ( 1.1 )) to compute 
7r, which is now called the BBP formula [I], 
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The remarkable discovery leads to the first digit-extraction algorithm for tt in base 2. In 
other words, it allows computing the n th bit of tt without computing the earlier bits. Soon 
after, Bellard has discovered a faster BBP-type formula [3], 
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He computed 152 bits of tt ending at the 1,000,000,000, 151 st bit position in 1997 gj. The 
computation took 12 days with more than 20 workstations and 180 days of CPU time. In 1998, 
Percival started a distributed computing project called PiHex to calculate the five trillionth 
bit, the forty trillionth bit and the quadrillionth bit of tt [12] . The best result obtained was 
64 bits of tt ending at the 1, 000, 000, 000, 000, 060 th position in 2000. The entire calculation 
took two years and required 250 CPU years, using idle time slices of 1734 machines in 56 
countries. The "average" computer participating was a 450 MHz Pentium II. For a survey 
on tt computations, see [5]. 

The remainder of the paper is organized as follows. The results are presented in next 
section. We discuss the BBP digit-extraction algorithm and our implementation in Section 
[3] and Section [IJ respectively. 



2 Results 



We have developed a program called DistBbp, which uses equation (1.2) to compute the n th 
bit of tt with arbitrary precision arithmetic. DistBbp employs the MapReduce programming 
model [5] and runs on Hadoop clusters. It has been used to compute 256 bits of tt around 



the two quadrillionth bit position as shown in Table 2.1 This is a new record, which doubles 
the position and quadruples the precision of the previous record obtained by PiHex. 



Bit Position n 


Bits of tt Starting at The n th Bit Position 
(in Hexadecimal) 


1,999,999,999,999,997 


0E6C1294 AED40403 F56D2D76 4026265B CA98511D 
0FCFFAA1 0F4D28B1 BB5392B8 

(256 bits) 



Table 2.1: The 256 bits of vr starting at the (2 
(2 • 10 15 + 252) nd bit position. 
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3) th bit position and ending at the 



We have also computed the first one billion bits and the bits at positions n = 10 m + 1 
for m < 15. Table 2fl_ below shows the results for 13 < m < 15. The results for n < 10 13 
are omitted since the corresponding bit values are well-known. It appears that the results 
for 13 < m < 14 are new, although their computation requirements are not as heavy as 
the one for m = 15. The result for m = 15 is similar to the one obtained by PiHex except 
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that the starting positions are slightly different and our result has a longer bit sequence. 
These computations were executed on the idle slices of the Hadoop clusters in Yahoo!. The 
cluster sizes range from 1000 to 4000 machines. Each machine has two quad-core CPUs with 
clock speed ranging from 1.8 GHz to 2.5 GHz. We have run at least two computations at 
different bit positions, usually n and n — 4, for each row in Tables 2.1 and 2.2 Only the bit 



values covered by two computations are considered as valid results in order to detect machine 



errors and transmission errors. Table 2.3 below shows the running time information for some 
computations. 



m 


Bit Position n = 10 m + 1 


Bits of 7r Starting at The n th Bit Position 
(in Hexadecimal) 


13 


10,000,000,000,001 


896DC3D3 


6A09E2E9 


29CA6F91 


66FBA8DC 


F000C4A6 






4C78723F 


814F2EB4 


6D417E5A 


4337FB1C 


C2EB474F 






74CCD953 


94FB7045 


3F7B48AE 


E758BDD2 


DD7B1371 






0CDB80EF 


72B70912 


E20281FC 


76FD0A10 


CDE2ADD8 






BD5163E1 


FC582BFE 


FB4D8F9A 


F4A771E8 


BA9F0B58 






C0334D55 


297ADDEB 


1DACB0EF 


B572D927 


DBDDB68D 






858929EA 


D8 












(1000 bits) 








14 


100,000,000,000,001 


C216EC69 


7A098CC4 


B9AF60D0 


5AE28EA9 


36873682 






D062B83B 


52C5C205 


CDA35F4D 


BCD0E9C3 


785CBFA7 






E62401AB 


B69AF82C 


CE885230 


03D4FC01 


7C620B11 






A94B99F7 


4DDE5102 


A5142280 


46B0055A 


636715D3 






75CB8BAC 


2003BA93 


27B731EA 


40341861 


27419284 






E3FFE685 


480637BF 


5C5BAE91 


3AFB7EA7 


45B4C955 






8E2EB177 














(992 bits) 










15 


1,000,000,000,000,001 


6216B069 
D95CC79A 

(228 bits) 


CB6C1D36 
C84E60D2 


117099E4 
3 


3646A6D4 


48D887CC 



Table 2.2: The bits of vr starting at the (10 m + l) st positions for 13 < m < 15. 



3 The BBP Digit-Extraction Algorithm 

We briefly describe the BBP digit-extraction algorithm in this section (see [1] for more de- 
tails). Any BBP-type formula, such as equation (1.1) or equation (1.2), can be used in the 
algorithm. For simplicity, we discuss the algorithm with equation (1.1) in this section. 
In order to obtain the (n + l) th bit, compute {2 n 7r}, where 



{x\ = x 



denotes the fraction part of x. By equation (j 1 . 1 p , we have 
{2"vr} = 



n— 1— 4ft 




4k + 3 



(3.1) 
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Starting Bit Position 


Precision (bits) 


Time Usecj^j 



CPlQTime 


Date Completed 


l 


cnn nm nnn 
oUU,UUi,UUU 


10 days 


19 years 


Tnn^ oq om n 
June zo, zUlU 


cnn nnn nm 
oUU,UUU,UUl 


onn nm nnn 
ZUU,UUi,UUU 


3 days 


8 years 


June zz, zUlU 


999,999,997 


1024 


102 seconds 


51 minutes 


June 10, 2010 


1,000,000,001 


1024 


96 seconds 


54 minutes 


June 11, 2010 


9,999,999,997 


1024 


2 minutes 


21 hours 


June 8, 2010 


10,000,000,001 


1024 


4 minutes 


21 hours 


June 8, 2010 


99,999,999,997 


1024 


10 minutes 


12 days 


June 6, 2010 


100,000,000,001 


1024 


9 minutes 


11 days 


June 6, 2010 


999,999,999,997 


1024 


105 minutes 


121 days 


June 7, 2010 


1,000,000,000,001 


1024 


98 minutes 


121 days 


June 7, 2010 


9,999,999,999,997 


1024 


10 hours 


4 years 


June 2, 2010 


10,000,000,000,001 


1024 


8 hours 


4 years 


June 1, 2010 


99,999,999,999,997 


1024 


4 days 


37 years 


June 11, 2010 


100,000,000,000,001 


1024 


5 days 


40 years 


June 7, 2010 


999,999,999,999,993 


288 


13 days 


248 years 


July 2, 2010 


1,000,000,000,000,001 


256 


25 days 


283 years 


July 6, 2010 


1,999,999,999,999,993 


288 


23 days 


582 years 


July 29, 2010 


1,999,999,999,999,997 


288 


23 days 


503 years 


July 25, 2010 



Table 2.3: The running time used in each computation. 



" Note that Time Used is not equivalent to "cluster time" since there were other jobs running on the cluster. 
b Note that the CPUs in a cluster may be slightly different. 



Then, evaluate each sum as below, 




(3.2) 



where 



A, 



def 2 n+x ~ 4k mod (yk + z) 
yk + z 



and 



B, 



def 



1 



2 4k - n ~ x (yk + z) 



(3.3) 



The number of terms in the first sum of equation (3.2) is linear to n. Each term is a modular 
exponentiation followed by a floating point division. In the second sum, it is only required 
to evaluate the first 0(p/logp) terms such that > e = 2~ p ~ 1 (see equation (3.6)) when 
working on p-bit precision. Each term is a reciprocal computation. For all the terms in both 
sums, all the operands are integers with O(logn) bits. 
The running time of the algorithm is 

0{p{n 1+e +p)) (3.4) 

bit ope ratio ns for any e > 0. Note that p is required to be O(logn) due to rounding error; see 
Section 



3.1 



When the algorithm is used to compute the n th bit with a small p, the running 
time is essentially linear in n. However, when the algorithm is used to compute the first p 
bits (i.e. n = 0), the running time is quadratic in p. In this case, there are faster algorithms 
[51 113] and [7], which run in essentially linear time. 
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It is easy to see that the required space for the BBP algorithm is 

0(p + \ogn) (3.5) 

bits. For p small, the computation task is CPU-intensive but not data-intensive. 

The algorithm is embarrassingly parallel because it mainly evaluates summations with a 
large number of terms. Evaluating these summations can be computed in parallel with little 
additional overhead. 



3.1 Rounding Error 

Since the outputs of the BBP algorithm are the exact bits of tt, it is important to under- 
stand the rounding errors that arose in the computation and how they impact the results. 
One simple way for diminishing the rounding error effect is to increase the precision in the 
computation. In practice, at least two independent computations at different bit positions, 
usually n and n— 4, are performed in order to verify the results. For example, the bit sequence 



shown in Table 2.1 was obtained by two computations shown at the last two rows of Table 
2.3 We discuss rounding error in more details in the rest of the section. 

When a real number is represented in p-h\t precision, the absolute relative rounding error 
is bounded above by 

e = 0.5 ulp = 2~ p ~ 1 , (3.6) 

where ulp is the unit in the last place [9] . For computing the n th bit of ix with precision p, the 
number of terms in the summations is m = 0(n + p). The cumulative absolute relative error 
is bounded above by me. For example, when computing the (10 15 ) th bit of tt with IEEE 754 



64-bit floating point, we have m ~ 7 • 10 (see equation (1.2)), p = 52 and e = 2 . Then, 



me ~ 0.0777 > 2- 4 , which means that even the third bit may be incorrect due to rounding 
error. In practice, around 28 bits are calculated correctly in this case. 

The long correct bit sequence can be explained by analyzing rounding errors with a 
probability model as follows. Let e^ be the error in the k th term and E = ^ e^ be the error 
of the sum. Suppose each e^ follows a uniform distribution over the closed interval [— e, e], 

£fc ~ U(-e,e). 

Then, E follows a uniform sum distribution (a.k.a. Irwin-Hall distribution) with mean and 
variance a 2 = me 2 /3. The random variables e^'s are independent, identically distributed and 
m is large. By the Central Limit Theorem, the sum distribution can be approximated by a 
normal distribution with the same mean and variance, i.e. 

E ~ N(0,me 2 /3). 



For m « 7 • 10 14 and e = 2" 53 , we have 72.79% confidence of \E\ < 2" 29 , 97.20% confidence 
of \E\ < 2~ 28 and 99.999 989% confidence of \E\ < 2~ 27 . 

Note that \E\ < 2~ b ~ 1 does not imply b correct bits because it is possible to have con- 
secutive 0's or l's affected by the error. For example, we have used 64-bit floating point to 
compute bits starting at the (10 15 + 53) rd position and obtained the following 52 bits. 



Position 
Hex 

Binary 



53 57 61 65 69 73 77 81 85 89 93 97 101 
D36116FA8581A 
1101 0011 0110 0001 0001 0110 1111 1010 1000 0101 1000 0001 1010 
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The corresponding true bit values are shown below. 

Position: 53 57 61 65 69 73 77 81 85 89 93 97 101 
Hex :D36117099E436 
Binary : 1101 0011 0110 0001 0001 0111 0000 1001 1001 1110 0100 0011 0110 

We have 2 -29 < \E\ < 2~ 28 but only the first 23 bits are correct due to the rounding error 
at the last of the four consecutive 0's in the true bit values. 



4 MapReduce 

In this section, we discuss our Hadoop MapReduce implementation of the BBP algorithm. 
For computing the bits of tt starting at position n with precision p, the algorithm basically 
evaluates the sum 



where each term Tj consists of a few arithmetic operations; see Section (4.2). We consider 
p is small, i.e. p = O(logn), throughout this section. Then, the size of the index set I is 
roughly 0.7n; see equation (1.2). For n = 10 15 , the size of / is approximately 7 • 10 14 . 



A straightforward approach is to partition the index set / into m pairwise disjoint sets 
111 " j I-m ' Then, evaluate each summation aj = f Yliei- by a ma PPer and compute the 
final sum S = X^i<j<m a j by a reducer. However, such implementation, which mainly relies 
on map-side computation, does not utilize a cluster because a cluster usually has a fixed ratio 
between map and reduce task capacities. Most of the reduce slots are not used in this case. 



The second problem is that the MapReduce job possibly runs for a long time; see Table 2.3 It 
is desirable to have a mechanism to persist the intermediate results, so that the computation 
is interruptible and resumable. 

In our design, we have multi-level partitioning. As before, the summation is first par- 
titioned into m smaller summations "Ej = Yliei T% such that the value of m is also small. 
Each Ej is computed by an individual MapReduce job. A controller program executed on 
a gateway machine is responsible for submitting these jobs. The summations Ej are further 
partitioned into tiny summations (jj t k = Yli^i where {Ij t i, • • • ,Ij,mA 1S a partition of 
Ij. Each Xj job has rrij tiny summations, which can be computed on either the map-side or 
the reduce-side; see Section [4. 1| below. Each tiny summation task is then assigned to a node 
machine by the MapReduce system. In the task level, if there are more than one available 
CPU cores in the node machine, the tiny summation is partitioned again so that each part is 
executed by a separated thread. The task outputs (T.- fc's are written to HDFS, a persistent 
storage of the Hadoop cluster [T3]. Then, the controller program reads (Tj k' s from HDFS, 
compute Sj = J2i<k<m a j,k an d write Ej back to HDFS. These intermediate results are 
persisted in HDFS. Therefore, the computation can possibly be interrupted at any time by 
killing the controller program and all the running jobs, and then be resumed in the future. 
The final sum S = J2i<j< m can be efficiently computed because m is relatively small. 
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The multi-level partitioning is summarized below 

Final Sum: 



Jobs: 

Tasks: 

Threads: 



S= £ S, 

l<j<m 



Kk<rrii 



a 3,k - ^ S i' k t 
l<t<mj j. 



4.1 Map-side & Reduce-side Computations 

In order to utilize the cluster resources, we have developed a general framework to execute 
computation tasks on either the map-side or the reduce-side. Applications only have to define 
two functions: 

1. partition(c, m): partition the computation c into m parts c\, . . . , c m ; 

2. compute(c): execute the computation c. 

A map-side job contains multiple mappers and zero reducers. The input computation c 
is partitioned into m parts by a Partition I nputFormat and then each part is executed by a 



mapper. See Figure [47T| below. 
Input 



Mapper 



Output 



c, m 


PartitionlnputFormat 




Mapper 


Output 


> 







Mapper 



Output 



Figure 4.1: Map-side computation. 

In contrast, a reduce-side job contains a single mapper and multiple reducers. A Single- 
tonlnputFormat is used to launch a single PartitionMapper, which is responsible to partition 
the computation c into m parts. Then, the parts are forwarded to an Indexer, which creates 



indexes for launching m reducers. See Figure 4.2 below. 
Input 



Reducer 



Output 



c, m 
> 


SingletonlnputFormat 


c, m 
— > 


PartitionMapper 


c v ,.c 

— > 


Indexer 




Reducer 


Output 













lOutput 

Reducer > 



Figure 4.2: Reduce-side computation. 
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Note that the utility classes PartitionlnputFormat, Mapper, SingletonlnputFormat, Partition- 
Mapper, Indexer and Reducer are provided by our framework. For map-side (or reduce-side) 
jobs, the user defined functions partition(c, m) and compute(c) are executed in Partitionlnput- 
Format (or PartitionMapper) and Mapper (or Reducer), respectively. 

The map and reduce task slots in a Hadoop cluster are statically configured. This frame- 
work allows computations utilizing both map and reduce task slots. 



4.2 Evaluating The Terms 



As shown in equations (3.2) and (3.3), there are two types of terms, and B^, in the 
summations of the BBP algorithm. The terms A^ involve a modular exponentiation and a 
floating point division. Modular exponentiation can be evaluated by the successive squaring 
method. When the modulus is large, we use Montgomery method which is faster than 
successive squaring in this case. 

Floating point division is implemented with arbitrary precision because of the rounding 
error issue discussed in Section 3.1 For the terms in equation (3.3), the division first is 
done first by shifting b = Ak — n — x bits and then followed by floating point division with 
(p — 6)-bit precision, where p is the selected precision. 



4.3 Utilizing Cluster Idle Slices 

One of our goals is to utilize the idle slices in a cluster. The controller program mentioned 
previously also monitors the cluster status. When there are sufficient available map (or 
reduce) slots, the controller program submits a map-side (or reduce-side) job. Each job is 
small so that it holds cluster resource only for a short period of time. 



In one of our computations (see the last row in Table 2.3), we had n = 2 • 10 15 — 3 and 
p = 288. The summation had approximately 1.4- 10 15 terms. It was executed in a 1000-node 
cluster. Each node had two quad-core CPUs with clock speed ranging from 2.0 GHz to 2.5 
GHz, and was configured to support four map tasks and two reduce tasks. The computation 
was divided into 35,000 jobs. Depending on the cluster load condition, the controller program 
might submit up to 60 concurrent jobs at any time. A job had 200 mappers with one thread 
each or 100 reducers with two threads each. Each thread computed a summation with roughly 
200,000,000 terms and took around 45 minutes. The entire computation took 23 days of real 
time and 503 years of CPU time. 



5 Conclusions & Future Works 

In this paper, we present our latest results on computing tt using Apache Hadoop. We extend 
the previous record of calculating specific bits of tt from position around the one quadrillionth 
bit to position around the two quadrillionth bit, and from 64-bit precision to 256-bit precision. 
The distributed computation is done through a MapReduce program called DistBbp. Our 
elastic computation framework automatically schedules computation slices as either map-side 
or reduce-side computation to fully exploit idle cluster resources. 

A natural extension of this work is to compute the bits of tt at higher positions, say the 
ten quadrillionth bit position, or even the quintillionth bit position. Besides, it is interesting 
to compute all the first n digits of tt with Hadoop clusters. Such computation task is not 
only CPU-intensive but also data-intensive. 
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